Record Linkage Measures in an Entity Centric World

نویسندگان

  • Matthew Michelson
  • Sofus A. Macskassy
چکیده

For unsupervised clustering, traditional accuracy metrics based on the constituent records do not often reflect the accuracy at the cluster level. For a specific example, consider entity resolution where the goal is to cluster records across multiple, heterogeneous data sources into “entities.” Measuring the accuracy of entity resolution is not as simple as applying the well known record level metrics of precision and recall. Rather than using traditional tuple-based metrics for accuracy, we posit that new, entity-based metrics should be defined instead. Defining entitylevel metrics gains users a less source biased, yet deeper insight into entity resolution performance. We show that traditional record linkage metrics are not appropriate, and offer some early thoughts on entity-centric measurements that are more so.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Leveraging Social Media Signals for Record Linkage

Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities. Traditional approaches use the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now a...

متن کامل

Regression classifier for Improved Temporal Record Linkage

Temporal record linkage is the process of identifying groups of records which are collected over long periods of time, such as census databases or voter registration databases, that represent the same real-world entities. These datasets often contain temporal information for each record, such as the time when a record was created, or the time when it was modified. Unlike traditional record link...

متن کامل

A note on using the F-measure for evaluating data linkage algorithms

Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide if a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques — including s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009